Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distributed training #8

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open

Distributed training #8

wants to merge 6 commits into from

Conversation

mjconnor
Copy link
Collaborator

No description provided.

@mjconnor mjconnor requested a review from akshay-anyscale July 12, 2023 22:00
@akshay-anyscale
Copy link
Contributor

LGTM - @matthewdeng to take a look

@mjconnor
Copy link
Collaborator Author

@matthewdeng sorry meant to request you. please review and MERGE!!!


To run:
```bash
python pytorch.py
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest having a slightly more descriptive name here, e.g. train_torch_model.py.

### Monitor
After launching the script, you can look at the Ray dashboard. It can be accessed from the Workspace home page and enables users to track things like CPU/GPU utilization, GPU memory usage, remote task statuses, and more!

![Dash](https://github.com/anyscale/templates/releases/download/media/workspacedash.png)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This image is highlighting VSCode 😅

[See here for more extensive documentation on the dashboard.](https://docs.ray.io/en/latest/ray-observability/getting-started.html)

### Model Saving
The model will be saved in the Anyscale Artifact Store, which is automatically set up and configured with your Anyscale deployment.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we point to Anyscale documentation here? I feel like this is introducing a new concept for something that should be more simple (a cloud storage bucket).

```bash
gsutil ls $ANYSCALE_ARTIFACT_STORAGE
```
Authentication is automatcially handled by default.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Authentication is automatcially handled by default.
Authentication is automatically handled by default.

### Submit as Anyscale Production Job
From within your Anyscale Workspace, you can run your script as an Anyscale Job. This might be useful if you want to run things in production and have a long running job. You can test that each Anyscale Job will spin up its own cluster (with the same compute config and cluster environment as the Workspace) and run the script. The Anyscale Job will automatically retry in event of failure and provides monitoring via the Ray Dashboard and Grafana.

To submit as a Production Job you can run:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we have consistency in the naming? We use "Anyscale Production Job", "Anyscale Job", and "Production Job" here - it may not be obvious to the user that all three of these combinations are meant to be the same thing 😄

for _ in range(epochs):
train_epoch(train_dataloader, model, loss_fn, optimizer)
loss = validate_epoch(test_dataloader, model, loss_fn)
session.report(dict(loss=loss))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should save a checkpoint here 😄

Comment on lines +147 to +152
parser.add_argument(
"--smoke-test",
action="store_true",
default=False,
help="Finish quickly for testing.",
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't used.

Comment on lines +7 to +8
min_workers: 1
max_workers: 3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think these values really make sense with the current training script. If we are showing distributed training with 2 GPUs, I think we should either have min_workers be 2 (to make the script run immediately) or 1 (if we want to show autoscaling).

Comment on lines +144 to +146
parser.add_argument(
"--use-gpu", action="store_true", default=True, help="Enables GPU training"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this makes it so that this value is always true?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants